arXiv: 1503.01673v3 [stat.ML] 13 May 2016 


High Dimensional Bayesian Optimisation and Bandits via Additive Models 


Kirthevasan Kandasamy 
Jeff Schneider 
Barnabas Poczos 

Carnegie Mellon University, Pittsburgh, PA, USA 


KANDASAMY@CS.CMU.EDU 

SCHNEIDE@CS.CMU.EDU 

BAPOCZOS@CS.CMU.EDU 


Abstract 

Bayesian Optimisation (BO) is a technique used 
in optimising a .D-dimensional function which 
is typically expensive to evaluate. While there 
have been many successes for BO in low dimen¬ 
sions, scaling it to high dimensions has been no¬ 
toriously difficult. Existing literature on the topic 
are under very restrictive settings. In this paper, 
we identify two key challenges in this endeavour. 

We tackle these challenges by assuming an addi¬ 
tive structure for the function. This setting is sub¬ 
stantially more expressive and contains a richer 
class of functions than previous work. We prove 
that, for additive functions the regret has only lin¬ 
ear dependence on D even though the function 
depends on all D dimensions. We also demon¬ 
strate several other statistical and computational 
benefits in our framework. Via synthetic exam¬ 
ples, a scientific simulation and a face detection 
problem we demonstrate that our method outper¬ 
forms naive BO on additive functions and on sev¬ 
eral examples where the function is not additive. 

1. Introduction 

In many applications we are tasked with zeroth order op¬ 
timisation of an expensive to evaluate function / in D di¬ 
mensions. Some examples are hyper parameter tuning in 
expensive machine learning algorithms, experiment design, 
optimising control strategies in complex systems, and sci¬ 
entific simulation based studies. In such applications, / is 
a blackbox which we can interact with only by querying 
for the value at a specific point. Related to optimisation is 
the bandits problem arising in applications such as online 
advertising and reinforcement learning. Here the objective 
is to maximise the cumulative sum of all queries. In either 
case, we need to find the optimum of / using as few queries 
as possible by managing exploration and exploitation. 
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Bayesian Optimisation (Mockus & Mockus, 1991) refers 
to a suite of methods that tackle this problem by modeling 
/ as a Gaussian Process (GP). In such methods, the chal¬ 
lenge is two fold. At time step t, first estimate the unknown 
/ from the query value-pairs. Then use it to intelligently 
query at x t where the function is likely to be high. For 
this, we first use the posterior GP to construct an acquisi¬ 
tion function ip t which captures the value of the experiment 
at a point. Then we maximise ip t to determine x t . 

Gaussian process bandits and Bayesian optimisation (GPB/ 
BO) have been successfully applied in many applications 
such as tuning hyperparameters in learning algorithms 
(Snoek et ah, 2012; Bergstra et ah, 2011; Mahendran et ah, 
2012), robotics (Lizotte et ah, 2007; Martinez-Cantin et ah, 
2007) and object tracking (Denil et ah, 2012). However, 
all such successes have been in low (typically < 10) di¬ 
mensions (Wang et ah, 2013). Expensive high dimensional 
functions occur in several problems in fields such as com¬ 
puter vision (Yamins et ah, 2013), antenna design (Hornby 
et ah, 2006), computational astrophysics (Parkinson et ah, 
2006) and biology (Gonzalez et ah, 2014). Scaling GPB/ 
BO methods to high dimensions for practical problems has 
been challenging. Even current theoretical results suggest 
that GPB/ BO is exponentially difficult in high dimensions 
without further assumptions (Srinivas et ah, 2010; Bull, 
2011). To our knowledge, the only approach to date has 
been to perform regular GPB/ BO on a low dimensional 
subspace. This works only under strong assumptions. 

We identify two key challenges in scaling GPB/ BO to high 
dimensions. The first is the statistical challenge in esti¬ 
mating the function. Nonparametric regression is inher¬ 
ently difficult in high dimensions with known lower bounds 
depending exponentially in dimension (Gyorfi et ah, 2002). 
The often exponential sample complexity for regression is 
invariably reflected in the regret bounds for GPB/ BO. The 
second is the computational challenge in maximising 
(fit- Commonly used global optimisation heuristics used 
to maximise ip t themselves require computation exponen¬ 
tial in dimension. Any attempt to scale GPB/ BO to high 
dimensions must effectively address these two concerns. 
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In this work, we embark on this challenge by treating / 
as an additive function of mutually exclusive lower dimen¬ 
sional components. Our contributions in this work are: 

1. We present the Add-GP-UCB algorithm for optimi¬ 
sation and bandits of an additive function. An attrac¬ 
tive property is that we use an acquisition function 
which is easy to optimise in high dimensions. 

2. In our theoretical analysis we bound the regret for 
Add-GP-UCB. We show that it has only linear de¬ 
pendence on the dimension D when / is additive 1 . 

3. Empirically we demonstrate that Add-GP-UCB out¬ 
performs naive BO on synthetic experiments, an astro- 
physical simulator and the Viola and Jones face detec¬ 
tion problem. Furthermore Add-GP-UCB does well 
on several examples when the function is not additive. 

A Matlab implementation of our methods is available on¬ 
line at github.com/kirthevasank/add-gp-bandits. 

2. Related Work 

GPB/ BO methods follow a family of GP based active 
learning methods which select the next experiment based 
on the posterior (Osborne et ah, 2012; Ma et ah, 2015; 
Kandasamy et ah, 2015). In the GPB/ BO setting, com¬ 
mon acquisition functions include Expected improvement 
(Mockus, 1994), probability of improvement (Jones et ah, 
1998), Thompson sampling (Thompson, 1933) and upper 
confidence bound (Auer, 2003). Of particular interest to 
us, is the Gaussian process upper confidence bound (GP- 
UCB). It was first proposed and analysed in the noisy set¬ 
ting by Srinivas et ah (2010) and extended to the noiseless 
case by de Freitas et ah (2012). Some literature stud¬ 
ies variants, such as combining several acquisition func¬ 
tions (Hoffman et ah, 2011) and querying in batches (Az- 
imi et ah, 2010). 

To our knowledge, most literature for GPB/ BO in high 
dimensions are in the setting where the function varies 
only along a very low dimensional subspace (Chen et ah, 
2012; Wang et ah, 2013; Djolonga et ah, 2013). In these 
works, the authors do not encounter either challenge as 
they perform GPB/ BO in either a random or carefully se¬ 
lected lower dimensional subspace. However, assuming 
that the problem is an easy (low dimensional) one hiding 
in a high dimensional space is often too restrictive. In¬ 
deed, our experimental results confirm that such methods 
perform poorly on real applications when the assumptions 
are not met. While our additive assumption is strong in its 
own right, it is considerably more expressive. It is more 

1 Post-publication it was pointed out to us that there was a bug 
in our analysis. We are working on resolving it and will post an 
update shortly. See Section 6 for more details. 


general than the setting in Chen et ah (2012). Even though 
it does not contain the settings in Djolonga et ah (2013); 
Wang et ah (2013), unlike them, we still allow the function 
to vary along the entire domain. 

Using an additive structure is standard in high dimensional 
regression literature both in the GP framework and other¬ 
wise. Hastie & Tibshirani (1990); Ravikumar et ah (2009) 
treat the function as a sum of one dimensional components. 
Our additive framework is more general. Duvenaud et ah 
(2011) assume a sum of functions of all combinations of 
lower dimensional coordinates. These literature argue that 
using an additive model has several advantages even if / is 
not additive. It is a well understood notion in statistics that 
when we only have a few samples, using a simpler model 
to fit our data may give us a better trade off for estimation 
error against approximation error. This observation is cru¬ 
cial: in many applications for Bayesian optimisation we are 
forced to work in the low sample regime since calls to the 
blackbox are expensive. Though the additive assumption 
is biased for nonadditive functions, it enables us to do well 
with only a few samples. While we have developed theo¬ 
retical results only for additive /, empirically we show that 
our additive model outperforms naive GPB/ BO even when 
the underlying function is not additive. 

Analyses of GPB/ BO methods focus on the query com¬ 
plexity of / which is the dominating cost in relevant appli¬ 
cations. It is usually assumed that ip t can be maximised to 
arbitrary precision at negligible cost. Common techniques 
to maximise (fit include grid search, Monte Carlo and mul¬ 
tistart methods (Brochu et ah, 2010). In our work we use 
the Dividing Rectangles (DiRect) algorithm of Jones et al. 
(1993). While these methods are efficient in low dimen¬ 
sions they require exponential computation in high dimen¬ 
sions. It is widely acknowledged in the community that 
this is a critical bottleneck in scaling GPB/ BO to high di¬ 
mensions (de Freitas, 2014). While we still work in the 
paradigm where evaluating / is expensive and characterise 
our theoretical results in terms of query complexity, we be¬ 
lieve that assuming arbitrary computational power to opti¬ 
mise (fit is too restrictive. For instance, in hyperparameter 
tuning the budget for determining the next experiment is 
dictated by the cost of the learning algorithm. In online ad¬ 
vertising and robotic reinforcement learning we need to act 
in under a few seconds or real time. 

In this manuscript. Section 3 formally details our problem 
and assumptions. We present Add-GP-UCB in Section 4 
and our theoretical results in Section 4.3. All proofs are de¬ 
ferred to Appendix B. We summarize the regrets for Add- 
GP-UCB and GP-UCB in Table 1. In Section 5 we 
present the experiments. 
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Kernel 

Squared Exponential 

Matern 

GP-UCB on D lh order kernel 

yjD D+2 T(\og T) D + 2 

„ , v + D ( D + 1) 

2 D y/DT 2 -'+ D ( D + 1 '> log T 

Add-GP-UCB on additive kernel 

yJd d D' 2 T(\ogT) d + 2 

, i/ + d(d+l) 

2 d DT^+^+v logT 


Table 1. Comparison of Cumulative Regret for GP-UCB and Add-GP-UCB for the Squared Exponential and Matern kernels. 


3. Problem Statement & Set up 


spectively. Writing r = \\x — x'\\ 2 , they are defined as 


We wish to maximise a function / : X —> R where X 
is a rectangular region in R D . We will assume w.l.o.g 
X = [0,1] D / may be nonconvex and gradient informa¬ 
tion is not available. We can interact with / only by query¬ 
ing at some x £ X and obtain a noisy observation y = 
f(x)+e. Let an optimum point be x* = argmax^^ f(x). 
Suppose at time t we choose to query at x f . Then we 
incur instantaneous regret rt = /(x*) — /(x t ). In the 
bandit setting, we are interested in the cumulative regret 

R t = ELi r t = ELi /( x *) ^ /( x t)> and in the op¬ 
timisation setting we are interested in the simple regret 
S T = min t < T r t = /(x*) - max Xt /(x t ). For a ban¬ 
dit algorithm, a desirable property is to have no regret. 
limT-).oo ^Rt = 0. Since St < ^Rt, an y such pro¬ 
cedure is also a consistent procedure for optimisation. 

Key structural assumption: In order to make progress in 
high dimensions, we will assume that / decomposes into 
the following additive form, 

f(x) = / (1) 0r (1) ) + fW(xM) + • • • + / (M) (zW). (1) 

Here each x^R € X^R = [0, l] dj are lower dimensional 
components. We will refer to the X^R’s as “groups” and 
the grouping of different dimensions into these groups 
as the “decomposition”. The groups are dis¬ 
joint - i.e. if we treat the elements of the vector a; as a set, 
x^Hx^R = 0. We are primarily interestd in the case when 
D is very large and the group dimensionality is bounded: 
dj < d <C D. We have !) x dM > E j dj. Paranthesised 
superscripts index the groups and a union over the groups 
denotes the reconstruction of the whole from the groups 
(e.g. x = U, X< ' J) and X = Uj X^R). x t denotes the point 
chosen by the algorithm for querying at time t. We will 
ignore log D terms in 0{-) notation. Our theoretical anal¬ 
ysis assumes that the decomposition is known but we also 
present a modified algorithm to handle unknown decompo¬ 
sitions and non-additive functions. 

Some smoothness assumptions on / are warranted to make 
the problem tractable. A standard in the Bayesian paradigm 
is to assume / is sampled from a Gaussian Process (Ras¬ 
mussen & Williams, 2006) with a covarince kernel k : 
X x A 7 —> 1 and that e ~ Af(0,r] 2 ). Two commonly 
used kernels are the squared exponential (SE) Krrj, and the 
Matern kernels with parameters (a, h) and (i/, h) re- 


K (Tth (x,x') = a ex p 
2i-^ 

K v ,h(x,x') = —-r 




( 2 ) 

(3) 


Here I’. B u are the Gamma and modified Bessel functions. 
A principal convenience in modelling our problem via a GP 
is that posterior distributions are analytically tractable. 

In keeping with this, we will assume that each /L) is 
sampled from a GP, QV{p^\ k^R) where the f(R ’s 
are independent. Here, pS-R : X^R a R is the mean 
and rSR : X^R x X^R -> R is the covariance for 
f(R. W.l.o.g let pj'R = q f or a u j xhis implies that 
/ itself is sampled from a GP with an additive kernel 
n(x,x') = E j k ^R (x^R , xSR ). We state this formally for 
nonzero mean as we will need it for the ensuing discussion. 


Observation 1. Let f be defined as in Equation (1), where 
f(R rv, QV{p^R{x),K^R{x^ l \x^R')). Lely = f(x) + e 
where e ~ Af( 0,?? 2 ). Denote 8(x,x') = 1 if x = 
x' , and 0 otherwise. Then y ~ QV(p(x), n(x, x') + 
, ij 2 8(x, x')) where 

p{x) = p (1 \x w ) + ■ ■ ■ + (4) 

k(x, x') = k^R (x(R , x^) + • • • + (x^R, X W). 


We will call a kernel such as k^R which acts only on d 
variables a d th order kernel. A kernel which acts on all 
the variables is a D th order kernel. Our kernel for / is a 
sum of M at most d th order kernels which, we will show, 
is statistically simpler than a D th order kernel. 

We conclude this section by looking at some seemingly 
straightforward approaches to tackle the problem. The first 
natural question is of course why not directly run GP-UCB 
using the additive kernel? Since it is simpler than a D {h or¬ 
der kernel we can expect statistical gains. While this is true, 
it still requires optimising ip t in I) dimensions to determine 
the next point which is expensive. 

Alternatively, for an additive function, we could adopt a se¬ 
quential approach where we use 1 /M fraction of our query 
budget to maximise the first group by keeping the rest of 
the coordinates constant. Then we proceed to the second 
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X\ X 2 X * 



Figure 1. Illustration of the additive GP model for 2 observa¬ 
tions where M = 2 in (1). The squared variables are observed 
while the circled variables are not. For brevity we have denoted 
for i = 1,2,*. We wish to infer the posterior 
distributions of the individual GPs (outlined in blue). 

group and so on. While optimising a d dimensional ac¬ 
quisition function is easy, this approach is not desirable for 
several reasons. First, it will not be an anytime algorithm as 
we will have to pre-allocate our query budget to maximise 
each group. Once we proceed to a new group we cannot 
come back and optimise an older one. Second, such an ap¬ 
proach places too much faith in the additive assumption. 
We will only have explored M d-dimensional hyperplanes 
in the entire space. Third, it is not suitable as a bandit algo¬ 
rithm as we suffer high regret until we get to the last group. 
We further elaborate on the deficiencies of this and other 
sequential approaches in Appendix A.2. 

4. Algorithm 

Under an additive assumption, our algorithm has two com¬ 
ponents. First, we obtain the posterior GP for each /W us¬ 
ing the query-value pairs until time t. Then we maximise a 
d dimensional GP-UCB-like acquisition function on each 
GP to construct the next query point. Since optimising ip t 
depends exponentially in dimension, this is cheaper than 
optimising one acquisition on the combined GP. 


The p lb element of k^\X^\x^) £ R" is k{x p \ x[^) 
and the (p, q) 01 element of k(X,X) £ R raxn is 
K(x p ,Xq). We have used the fact Cov(/i*\ y p ) = 
Cov(/* W , £. +v 2 e) = Cov(/i i} , /«(*£>)) = 

k^(x*\ Xp^) as /C) _L /W,Vi ^ j. By writing A = 
k(X, X) + r) 2 I n £ M" xr \ the posterior for is, 

fi j) \x„X,Y ~ AA( K C)( a; C) ! xC))A- 1 F, (5) 

k^(x¥\x^) — X^)A~ 1 k^\X, x^)) 

4.2. The Add-GP-UCB Algorithm 

In GPB/ BO algorithms, at each time step t we maximise 
an acquisition function ip t to determine the next point: 
x t = argmax^g^ ip t (x). The acquisition function is itself 
constructed using the posterior GP. The GP-UCB acquisi¬ 
tion function, which we focus on here is, 

1/2 

<pt(x) = Mt-iw +/V °t-l(x). 

Intuitively, the pt-i term in the GP-UCB objective prefers 
points where / is known to be high, the er t _i term prefers 
points where we are uncertain about / and (3 t " negotiates 
the tradeoff. The former contributes to the “exploitation” 
facet of our problem, in that we wish to have low instan¬ 
taneous regret. The latter contributes to the “exploration” 
facet since we also wish to query at regions we do not know 
much about / lest we miss out on regions where / is high. 
We provide a brief summary of GP-UCB and its theoreti¬ 
cal properties in Appendix A. 1. 

As we have noted before, maximising <p t which is typi¬ 
cally multimodal to obtain x t is itself a difficult problem. 
In any grid search or branch and bound methods such as Di- 
Rect, maximising a function to within £ accuracy, requires 
0((- D ) calls to ipt- Therefore, for large D maximising 
ipt is extremely difficult. In practical settings, especially in 
situations where we are computationally constrained, this 
poses serious limitations for GPB/ BO as we may not be 
able to optimise ip t to within a desired accuracy. 


4.1. Inference on Additive GPs 

Typically in GPs, given noisy labels, Y = {y- t ,..., y n } at 
points A' = {aq,. .. ,x n }, we are interested in inferring 
the posterior distribution for /* = /(a;*) at a new point 
x*. In our case though, we will be primarily interested in 
the distribution of = f^\xi^) conditioned on X, Y. 
We have illustrated this graphically in Figure 1. The joint 
distribution of and Y can be written as 



~ AT 



K^\x^\x^) 

K^(X^,xi j) ) 


K^{xi j) ,X^) \ 
K(X,X)+rj 2 I n \) ' 


Fortunately, in our setting we can be more efficient. We 
propose an alternative acquisition function which applies to 
an additive kernel. We define the Additive Gaussian Pro¬ 
cess Upper Confidence Bound (Add-GP-UCB) to be 

M 

ip t (x) = pt-i{x) + fit /2 ( 6 ) 

3 = 1 

We immediately see that we can write <p t as a sum of 
functions on orthogonal domains: <pt{x) = £ • p^\x^) 

where p^\x^) = This 

means that <p t can be maximised by maximising each 
separately on As we need to solve Ad at 
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most d dimensional optimisation problems, it requires only 
0(M d+1 (~ d ) calls to the utility function in total - far more 
favourable than maximising ip t . 

Since the cost for maximising the acquisition function is a 
key theme in this paper let us delve into this a bit more. One 
call to <p t requires 0{Dt 2 ) effort. For fi t we need M calls 
each requiring 0(djt 2 ) effort. So both ip t and fi t require 
the same effort in this front. For ip t , we need to know the 
posterior for only / whereas for fi t we need to know the 
posterior for each However, the brunt of the work in 
obtaining the posterior is the 0(t 3 ) effort in inverting the 
txt matrix A in (5) which needs to be done for both tp t and 
fit. For fit, we can obtain the inverse once and reuse it M 
times, so the cost of obtaining the posterior is 0(t 3 + Mt 2 ). 
Since the number of queries needed will be super linear in 
D and hence M, the t 3 term dominates. Therefore obtain¬ 
ing each posterior /w is only marginally more work than 
obtaining the posterior for /. Any difference here is easily 
offset by the cost for maximising the acquisition function. 

The question remains then if maximising fit would result in 
low regret. Since ip t and fit are neither equivalent nor have 
the same maximiser it is not immediately apparent that this 
should work. Nonetheless, intuitively this seems like a rea¬ 
sonable scheme since the JN cr^i term captures some no¬ 
tion of the uncertainty and contributes to exploration. In 
Theorem 5 we show that this intuition is reasonable - max¬ 
imising fit achieves the same rates as <p t for cumulative and 
simple regrets if the kernel is additive. 

We summarise the resulting algorithm in Algorithm 1. In 
brief, at time step t, we obtain the posterior distribution for 
/W and maximise fi\ d) to determine the coordinates xp' i . 
We do this for each j and then combine them to obtain x t . 


statistical difficulty of the problem as determined by the 
kernel. We show that under additive kernels the problem 
is much easier than when using a full D' h order kernel. 
Next, we show that the Add-GP-UCB algorithm is able to 
exploit the additive structure and obtain the same rates as 
GP-UCB. The advantage to using Add-GP-UCB is that 
it is much easier to optimise the acquisition function. For 
our analysis, we will need Assumption 2 and Definition 3. 


Assumption 2. Let f be sampled from a GP with kernel k. 
k(-, x) is L-Lipschitz for all x. Further, the partial deriva¬ 
tives of f satisfies the following high probability bound. 
There exists constants a, b > 0 such that, 


P sup 


df(x) 

dxi 



<ae-( J / b > 2 . 


The Lipschitzian condition is fairly mild and the latter 
condition holds for four times differentiable stationary 
kernels such as the SE and Matern kernels for v > 2 
(Ghosal & Roy, 2006). Srinivas et al. (2010) showed 
that the statistical difficulty of GPB/ BO is determined 
by the Maximum Information Gain as defined below. We 
bound this quantity for additive SE and Matern kernels in 
Theorem 4. This is our first main theorem. 

Definition 3. (Maximum Information Gain) Let f ~ 
QV{p,n), yi = f(xi) + e where e ~ A/"(0, r 7 2 ). Let 
A = {xi ,..., Xt} C X be a finite subset, Ja denote the 
function values at these points and i/a denote the noisy ob¬ 
servations. Let I be the Shannon Mutual Information. The 
Maximum Information Gain between da and f a is 

7T = , max I^a'Ja)- 
A<ZX,\A\—T 


Algorithm 1 Add-GP-UCB 

Input: Kernels ..., n^ M \ Decomposition (X^)jL 1 

• Dq i — 0, 

. for j = 1,..., M, U j) ,K ( 0 j) ) <- (0, kCj)). 

• for t = 1,2,... 

1. for j = 1,..., M, 

x^ j) <- argmax ze;t( .,) p^l^z) + 

2. x t <- Ujlix?' 1 - 

3. y t <- Query / at x t . 

4. V t = T> t _ 1 U {(x t ,y t )}. 

5. Perform Bayesian posterior updates conditioned 
on T> t to obtain p!fi ' > , for j = 1,..., M. 


4.3. Main Theoretical Results 

Now, we present our main theoretical contributions. We 
bound the regret for Add-GP-UCB under different ker¬ 
nels. Following Srinivas et al. (2010), we first bound the 


Theorem 4. Assume that the kernel n has the additive form 
of (4), and that each satisfies Assumption 2. W.l.o.g 
assume k(x,x') = 1. Then, 

1. If each is a d fdl order squared exponential ker¬ 
nel (2) where dj < d, then yx € 0(Dd d (\ogT) d+1 ). 

2. If each / s a (f. h order Matern kernel (3) 

where dj < d and u > 2, then 7 t £ 

d.(d+ 1 ) 

0(D2 d T^+dm +it log(T)). 

We use bounds on the eigenvalues of the SE and Matern 
kernels from Seeger et al. (2008) and a result from Srini¬ 
vas et al. (2010) which bounds the information gain via the 
eigendecay of the kernel. We bound the eigendecay of the 
sum k via M and the eigendecay of a single The 

complete proof is given in Appendix B.l. The important 
observation is that the dependence on D is linear for an 
additive kernel. In contrast, for a D lb order kernel this is 
exponential (Srinivas et al., 2010). 
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Next, we present our second main theorem which bounds 
the regret for Add-GP-UCB for an additive kernel as 
given in Equation 4. 


Theorem 5. Suppose f is constructed by sampling /w ~ 
QV(0, K (j) ) for j = 1,..., M and then adding them. Let 
all kernels satisfy assumption 2 for some L , a, b. Fur¬ 
ther, we maximise the acquisition function tp t to within 
Cot^ 1 / 2 accuracy at time step t. Pick 5 £ (0,1) and choose 


0t = 2 log 


Mvff 1 

26 


2d log (Dt 3 ) £ O (dlogt). 


Then, Add-GP-UCB attains cumulative regret Rt £ 
O ^\/DyxT \og T^j and hence simple regret St £ 

O ^\JDyx log T jT' j . Precisely, with probability > 1 — 6, 


VT >1, R t < y/8C 1 /3 T MT'y t + 2( 0 Vf + C 2 . 

where C\ = 1/ log(l + p~ 2 ) and C 2 is a constant depend¬ 
ing on a, b, D, 6, L and p. 


Part of our proof uses ideas from Srinivas et al. (2010). 
We show that (•) forms a credible interval for 

/(•) about the posterior mean p t (-) for an additive kernel in 
Add-GP-UCB. We relate the regret to this confidence set 
using a covering argument. We also show that our regret 
doesn’t suffer severely if we only approximately optimise 
the acquisition provided that the accuracy improves at rate 
0(£ -1 / 2 ). For this we establish smoothness of the poste¬ 
rior mean. The correctness of the algorithm follows from 
the fact that Add-GP-UCB can be maximised by individ¬ 
ually maximising fp 1 on each X^K The complete proof 
is given in Appendix B.2. When we combine the results 
in Theorems 4 and 5 we obtain the rates given in Table l 2 . 


One could consider alternative lower order kernels - one 
candidate is the sum of all possible d th order kernels (Du- 
venaud et al., 2011). Such a kernel would arguably al¬ 
low us to represent a larger class of functions than our 
kernel in (4). If, for instance, we choose each of them 
to be a SE kernel, then it can be shown that tt £ 
0{D d d d+1 (\ogT) d+1 ). Even though this is worse than our 
kernel in poly (/l) factors, it is still substantially better than 
using a D' h order kernel. However, maximising the corre¬ 
sponding utility function, either of the form ip t or p t , is still 
a D dimensional problem. We reiterate that what renders 
our algorithm attractive in large D is not just the statistical 
gains due to the simpler kernel. It is also the fact that our 
acquisition function can be efficiently maximised. 


4.4. Practical Considerations 

Our practical implementation differs from our theoretical 
analysis in the following aspects. 

2 See Footnote 1. 


Choice of /3 t : i3 f as specified by Theorems 5, usually tends 
to be conservative in practice (Srinivas et al., 2010). For 
good empirical performance a more aggressive strategy is 
required. In our experiments, we set /3 t = 0.2dlog(2i) 
which offered a good tradeoff between exploration and ex¬ 
ploitation. Note that this captures the correct dependence 
on D, d and t in Theorems 5 and 6. 

Data dependent prior: Our analysis assumes that we 
know the GP kernel of the prior. In reality this is rarely the 
case. In our experiments, we choose the hyperparameters 
of the kernel by maximising the GP marginal likelihood 
(Rasmussen & Williams, 2006) every N cyc iterations. 

Initialisation: Marginal likelihood based kernel tuning 
can be unreliable with few data points. This is a problem in 
the first few iterations. Following the recommendations in 
Bull (2011) we initialise Add-GP-UCB (and GP-UCB) 
using Ni n n points selected uniformly at random. 

Decomposition & Non-additive functions: If / is ad¬ 
ditive and the decomposition is known, we use it directly. 
But it may not always be known or / may not be addi¬ 
tive. Then, we could treat the decomposition as a hyperpa¬ 
rameter of the additive kernel and maximise the marginal 
likelihood w.r.t the decomposition. However, given that 
there are D\/d\ M M\ possible decompositions, comput¬ 
ing the marginal likelihood for all of them is infeasible. 
We circumvent this issue by randomly selecting a few 
( 0(D )) decompositions and choosing the one with the 
largest marginal likelihood. Intuitively, if the function is 
not additive, with such a “partial maximisation” we can 
hope to capture some existing marginal structure in /. At 
the same time, even an exhaustive maximisation will not do 
much better than a partial maximisation if there is no addi¬ 
tive structure. Empirically, we found that partially optimis¬ 
ing for the decomposition performed slightly better than 
using a fixed decomposition or a random decomposition 
at each step. We incorporate this procedure for finding an 
appropriate decomposition as part of the kernel hyper pa¬ 
rameter learning procedure every N cyc iterations. 

How do we choose (d, M) when / is not additive? If d is 
large we allow for richer class of functions, but risk high 
variance. For small d, the kernel is too simple and we have 
high bias but low variance - further optimising p t is easier. 
In practice we found that our procedure was fairly robust 
for reasonable choices of d. Yet this is an interesting theo¬ 
retical question. We also believe it is a difficult one. Using 
the marginal likelihood alone will not work as the optimal 
choice of d also depends on the computational budget for 
optimising f> t . We hope to study this question in future 
work. For now, we give some recommendations at the end. 
Our modified algorithm with these practical considerations 
is given below. Observe that in this specification if we use 
d = D we have the original GP-UCB algorithm. 
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Algorithm 2 Practical-Add-GP-UCB 
Input: Niniti N cyc , d, M 

• T>o <r- Ni n it points chosen uniformly at random. 

• for t = 1,2,... 

1. if (t mod N cyc = 0), Learn the kernel hyper 
parameters and the decomposition {A/} by max¬ 
imising the GP marginal likelihood. 

2. Perform steps 1-3 in Algorithm 1 with /3 t = 
0.2d log 2 1. 

3. V t = 2? t _! U {(x t ,y t )}. 

4. Perform Bayesian posterior updates conditioned 
on V t to obtain /ijr 7 '*, a ^ 1 for j = 1,..., M. 



Figure 2. Illustration of the trimodal function f d / in d' = 2. 


5. Experiments 

To demonstrate the efficacy of Add-GP-UCB over GP- 
UCB we optimise the acquisition function under a con¬ 
strained budget. Following, Brochu et al. (2010) we use 
DiRect to maximise (fit, fit- We compare Add-GP-UCB 
against GP-UCB, random querying (RAND) and DiRect 3 . 
On the real datasets we also compare it to the Expected 
Improvement (GP-EI) acquisition function which is pop¬ 
ular in BO applications and the method of Wang et al. 
(2013) which uses a random projection before applying BO 
(REMBO). We have multiple instantiations of Add-GP- 
UCB for different values for (d. M). For optimisation, we 
perform comparisons based on the simple regret St and for 
bandits we use the time averaged cumulative regret Rt/T. 

For all GPB/ BO methods we set N inlt = 10, N cyc = 25 
in all experiments. Further, for the first 25 iterations we 
set the bandwidth to a small value (10~ 5 ) to encourage 
an explorative strategy. We use SE kernels for each ad¬ 
ditive kernels and use the same scale er and bandwidth h 
hyperparameters for all the kernels. Every 25 iterations we 
maximise the marginal likelihood with respect to these 2 
hyperparameters in addition to the decomposition. 

3 There are several optimisation methods based on simulated 
annealing, cross entropy and genetic algorithms. We use DiRect 
since its easy to configure and known to work well in practice. 


In contrast to existing literature in the BO community, 
we found that the UCB acquisitions outperformed GP-EI. 
One possible reason may be that under a constrained bud¬ 
get, UCB is robust to imperfect maximisation (Theorem 5) 
whereas GP-EI may not be. Another reason may be our 
choice of constants in UCB (Section 4.4). 


5.1. Simulations on Synthetic Data 

First we demonstrate our technique on a series of synthetic 
examples. For this we construct additive functions for dif¬ 
ferent values for the maximum group size d! and the num¬ 
ber of groups M'. We use the prime to distinguish it from 
Add-GP-UCB instantiations with different combinations 
of ( d , M) values. The d! dimensional function is. 


f d '{x) = log ( 0 . 1^7 exp 


- tg| 

2h 2 d , 


(7) 


° J /Z eXP 


V 2 h% ) 


+ 0 .8^ ex p 


(^)) 


where v\, t> 2 , are fixed d' dimensional vectors and hd' = 
O.Old' 0 ' 1 . Then we create M' groups of coordinates by 
randomly adding d! coordinates into each group. On each 
such group we use fd' and then add them up to obtain the 
composite function /. Precisely, 

f(x) = U(x^) + --- + U(x^) 


The remaining D — dlM' coordinates do not contribute 
to the function. Since fd> has 3 modes, / will have 3 M 
modes. We have illustrated /,/< for d' = 2 in Figure 2. 

In the synthetic experiments we use an instantiation 
of Add-GP-UCB that knows the decomposition-i.e. 
( d,M ) = (d',M') and the grouping of coordinates. We 
refer to this as Add-*. For the rest we use a ( d, M) de¬ 
composition by creating M groups of size at most d and 
find a good grouping by partially maximising the marginal 
likelihood (Section 4.4). We refer to them as Add -d/M. 

For GP-UCB we allocate a budget of min(5000,100ZT) 
DiRect function evaluations to optimise the acquisition 
function. For all Add -d/M methods we set it to 90% of 
this amount 4 to account for the additional overhead in pos¬ 
terior inference for each f <J> . Therefore, in our 10/1 prob¬ 
lem we maximise <p t with f3 t = 2 log(2t) with 1000 DiRect 
evaluations whereas for Add-2/5 we maximise each 
with /3 t = 0.4 log(2f) with 180 evaluations. 

The results are given in Figures 3 and 4. We refer to each 
example by the configuration of the additive function-its 
(D, d ', M') values. In the (10,3, 3) example Add-* does 

4 While the 90% seems arbitrary, in our experiments this was 

hardly a factor as the cost was dominated by the inversion of A. 
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(D,d',M') = (10,3,3) 


(D,d',M') = (24,6,4) 


(D, d' , M' ) = (24,11,2) 



(a) 


(b) 


(D,d',M') = (10,3,3) 


(D,d',M') = (24,6,4) 







0 ) 


(k) 


( 1 ) 


Figure 3. Results on the synthetic datasets. In all images the a>axis is the number of queries and the y-axis is the regret 
in log scale. We have indexed each experiment by their (D,d!,M') values. The first row is St for the experiments with 
( D , d !, M') set to (10, 3,3), (24, 6, 4), (24,11, 2) and the second row is RtIT for the same experiments. The third row is St for 
(40, 5, 8), (40,18, 2), (40, 35,1) and the fourth row is the corresponding Rt/T. In some figures, the error bars are not visible since 
they are small and hidden by the bullets. All figures were produced by averaging over 20 runs. 
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<D,d',M') = (96,5,19) <D,d',M') = (96,29,3) (D,d',M') = (120,55,2) 





(D,d',M') = (96,5,19) 


<D,d',M') = (96,29,3) 


(D, d' ,M') = (120,55,2) 





(d) 


(e) 


(f) 


Figure 4. More results on synthetic experiments. The simple regret St (first row) and cumulative regret Rt /T (second row) for 
functions with ( D, d ', M') set to (96, 5,19), (96, 29, 3), (120, 55, 2) respectively. Read the caption under Figure 3 for more details. 


best since it knows the correct model and the acquisition 
function can be maximised within the budget. However 
Add-3/4 and Add-5/2 models do well too and outperform 
GP-UCB. Add-1/10 performs poorly since it is statisti¬ 
cally not expressive enough to capture the true function. 
In the (24,11,2), (40,18,2), (40,35,1), (96,29,3) and 
(120, 55, 2) examples Add-* outperforms GP-UCB. How¬ 
ever, it is not competitive with the Add -cl/M for small d. 
Even though Add-* knew the correct decomposition, there 
are two possible failure modes since d! is large. The kernel 
is complex and the estimation error is very high in the ab¬ 
sence of sufficient data points. In addition, optimising the 
acquisition is also difficult. This illustrates our previous ar¬ 
gument that using an additive kernel can be advantageous 
even if the function is not additive or the decomposition is 
not known. In the (24,6,4), (40, 5, 8) and (96, 5,19) ex¬ 
amples Add-* performs best as d! is small enough. But 
again, almost all Add -d/M instantiations outperform GP- 
UCB. In contrast to the small D examples, for large D, 
GP-UCB and Add -d/M with large d perform worse than 
DiRect. This is probably because our budget for maximis¬ 
ing ip t is inadequate to optimise the acquisition function 
to sufficient accuracy. For some of the large D examples 
the cumulative regret is low for Add-GP-UCB and Add- 
d/M with large d. This is probably since they have al¬ 
ready started exploiting where as the Add -d/M with small 


d methods are still exploring. We posit that if we run for 
more iterations we will be able to see the improvements. 

5.2. SDSS Astrophysical Dataset 

Here we used Galaxy data from the Sloan Digital Sky Sur¬ 
vey (SDSS). The task is to find the maximum likelihood 
estimators for a simulation based astrophysical likelihood 
model. Data and software for computing the likelihood are 
taken from Tegmark et al (2006). The software itself takes 
in only 9 parameters but we augment this to 20 dimensions 
to emulate the fact that in practical astrophysical problems 
we may not know the true parameters on which the prob¬ 
lem is dependent. This also allows us to effectively demon¬ 
strate the superiority of our methods over alternatives. Each 
query to this likelihood function takes about 2-5 seconds. 
In order to be wall clock time competitive with RAND and 
DiRectwe use only 500 evaluations for GP-UCB, GP-EI 
and REMBO and 450 for Add -d/M to maximise the ac¬ 
quisition function. 

We have shown the Maximum value obtained over 400 it¬ 
erations of each algorithm in Figure 5(a). Note that RAND 
outperforms DiRect here since a random query strategy 
is effectively searching in 9 dimensions. Despite this ad¬ 
vantage to RAND all BO methods do better. Moreover, 
despite the fact that the function may not be additive, all 
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Add -d/M methods outperform GP-UCB. Since the func¬ 
tion only depends on 9 parameters we use REMBO with 
a 9 dimensional projection. Yet, it is not competitive with 
the Add -d/M methods. Possible reasons for this may in¬ 
clude the scaling of the parameter space by \fd in REMBO 
and the imperfect optimisation of the acquisition function. 
Here Add-5/4 performs slightly better than the rest since it 
seems to have the best tradeoff between being statistically 
expressive enough to capture the function while at the same 
time be easy enough to optimise the acquisition function 
within the allocated budget. 



(a) 


5.3. Viola & Jones Face Detection 

The Viola & Jones (VJ) Cascade Classifier (Viola & Jones, 
2001) is a popular method for face detection in computer 
vision based on the Adaboost algorithm. The /T-cascade 
has K weak classifiers which outputs a score for any given 
image. When we wish to classify an image we pass that 
image through each classifier. If at any point the score falls 
below a certain threshold the image is classified as nega¬ 
tive. If the image passes through all classifiers then it is 
classified as positive. The threshold values at each stage 
are usually pre-set based on prior knowledge. There is no 
reason to believe that these threshold values are optimal. In 
this experiment we wish to find an optimal set of values for 
these thresholds by optimising the classification accuracy 
over a training set. 

For this task, we use 1000 images from the Viola & Jones 
face dataset containing both face and non-face images. We 
use the implementation of the VJ classifier that comes with 
OpenCV (Bradski & Kaehler, 2008) which uses a 22-stage 
cascade and modify it to take in the threshold values as a 
parameter. As our domain X we choose a neighbourhood 
around the configuration given in OpenCV. Each function 
call takes about 30-40 seconds and is the the dominant 
cost in this experiment. We use 1000 DiRect evaluations 
to optimise the acquisition function for GP-UCB, GP- 
EI and REMBO and 900 for the Add-d/M instantiations. 
Since we do not know the structure of the function we use 
REMBO with a 5 dimensional projection. The results are 
given in Figure 5(b). Not surprisingly, REMBO performs 
worst as it is only searching on a 5 dimensional space. Bar¬ 
ring Add- 1 /22 all other instantiations perform better than 
GP-UCB and GP-EI with Add-6/4 performing the best. 
Interestingly, we also find a value for the thresholds that 
outperform the configuration used in OpenCV. 

6. Conclusion 

Recommendations: Based on our experiences, we rec¬ 
ommend the following. If / is known to be additive, the 
decomposition is known and d is small enough so that </> t 
can be efficiently optimised, then running Add-GP-UCB 



(b) 


Figure 5. Results on the Astrophysical experiment (a) and the Vi¬ 
ola and Jones dataset (b). The ir-axis is the number of queries and 
the j/-axis is the maximum value. 

with the known decomposition is likely to produce the best 
results. If not, then use a small value for d and run Add- 
GP-UCB while partially optimising for the decomposition 
periodically (Section 4.4). In our experiments we found 
that using d between 3 an 12 seemed reasonable choices. 
However, note that this depends on the computational bud¬ 
get for optimising the acquisition, the query budget for / 
and to a certain extent the the function / itself. 

Summary: Our algorithm takes into account several prac¬ 
tical considerations in real world GPB/ BO applications 
such as computational constraints in optimising the acqui¬ 
sition and the fact that we have to work with a relatively 
few data points since function evaluations are expensive. 
Our framework effectively addresses these concerns with¬ 
out considerably compromising on the statistical integrity 
of the model. We believe that this provides a promising 
direction to scale GPB/ BO methods to high dimensions. 

Future Work: Our experiments indicate that our methods 
perform well beyond the scope suggested by our theory. 
Developing an analysis that takes into account the bias- 
variance and computational tradeoffs in approximating and 
optimising a non-additive function via an additive model 
is an interesting challenge. We also intend to extend this 
framework to discrete settings, other acquisition functions 
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and handle more general decompositions. 
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A. Some Auxiliary Material 

A.l. Review of the GP-UCB Algorithm 

In this subsection we present a brief summary of the GP-UCB algorithm in (Srinivas et al., 2010). The algorithm is given 
in Algorithm 3. 

The following theorem gives the rate of convergence for GP-UCB. Note that under an additive kernel, this is the same 
rate as Theorem 5 which uses a different acquisition function. Note the differences in the choice of fi t . 


Theorem 6. (Modification of Theorem 2 in (Srinivas et al., 2010)) Suppose f is constructed by sampling fk>) ~ 
GV(P, for j = 1,..., M and then adding them. Let all kernels rfo) satisfy assumption 2 for some L , a, b. Fur¬ 
ther, we maximise the acquisition function fip t to within Co^ -1 ^ 2 accuracy at time step t. Pick 5 £ (0,1) and choose 


/2f 2 7T 2 \ 

fit = 21og( — J +2Dlog(Dt 3 ) £ O(Dlogt). 


Then, GP-UCB attains cumulative regret R x £ O ^ \J Dy x T log T' j and hence simple regret St £ O (y\J Dy x log T/T^j. 
Precisely, with probability >1 — 5, 

VT >1, R t < y j8C 1 fi T MT'y t + 2( 0 Vf + C 2 . 


where C\ = 1/log(l + rj 2 ) and C 2 is a constant depending on a, b, D, S, L and p. 


Proof. Srinivas et al. (2010) bound the regret for exact maximisation of the GP-UCB acquisition tp t . By following an 
analysis similar to our proof of Theorem 5 the regret can be shown to be the same for an Qfir 1 / 2 - optimal maximisation. □ 


Algorithm 3 GP-UCB 
Input: Kernel k. Input Space X. 
For t = 1,2... 

• £> 0 <- 0 , 

• (mo,ko) <- (0,k) 

• for t = 1,2,... 


1. x t «- argmax z6A . p t -i{z) + s/Wt^t- 1 (^) 

2. y t <- Query / at x t . 

3. V t = V t -\ U {(x t ,y t )}. 

4. Perform Bayesian posterior updates to obtain p t . at for j = 1...., M. 


A.2. Sequential Optimisation Approaches 

If the function is known to be additive, we could consider several other approaches for maximisation. We list two of them 

here and explain their deficiencies. We recommend that the reader read the main text before reading this section. 

A.2.1. Optimise one group and proceed to the next 

First, fix the coordinates of x^' 1 , j f 1 and optimise w.r.t by querying the function for a pre-specified number of times. 

Then we proceed sequentially optimising with respect to x { 2 ) jX (3) 

We have outlined this algorithm in Algorithm 4. 

There are several reasons this approach is not desirable. 

• First, it places too much faith on the additive assumption and requires that we know the decomposition at the start 
of the algorithm. Note that this strategy will only have searched the space in M d-dimensional subspaces. In our 
approach even if the function is not additive we can still hope to do well since we learn the best additive approximation 
to the true function. Further, if the decomposition is not known we could learn the decomposition “on the go” or at 
least find a reasonably good decomposition as we have explained in Section 4.4. 
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• Such a sequential approach is not an anytime algorithm. This in particular means that we need to predetermine the 
number of queries to be allocated to each group. After we proceed to a new group it is not straightforward to come 
back and improve on the solution obtained for an older group. 

• This approach is not suitable for the bandits setting. We suffer large instantaneous regret up until we get to the last 
group. Further, after we proceed beyond a group since we cannot come back, we cannot improve on the best regret 
obtained in that group. 

Our approach does not have any of these deficiencies. 


Algorithm 4 Seq-Add-GP-UCB 

Input: Kernels kW, ..., k^ m \ Decomposition (X^)jL v Query Budget T, 

• R d 3 0 = [jf =1 = rand([0, l] d ) 

• for j = 1, M 

1 . 'Dq' 1 i — 0 , 

2. 

3. forf = 1,2,... T/M 

(a) xp } argmax.g^o) ^\z) + 

(b) x t <- x t (j) U Mi 

(c) y* t- Query / at x t . 

fd) 2? t (, ' ) =2? t ( i ) 1 U{(x t W) ,y t )}. 

(j) (n) 

(e) Perform Bayesian posterior updates to obtain n\ , oy . 

4 gU) j— 

• Return 0 


A.2.2. Only change one Group per Query 

In this strategy, the approach would be very similar to Add-GP-UCB except that at each query we will only update one 
group at time. If it is the k ih group the query point is determined by maximising ip\ for x. f and for all other groups we use 
values from the previous rotation. After M iterations we cycle through the groups. We have outlined this in Algorithm 5. 

This is a reasonable approach and does not suffer from the same deficiencies as Algorithm 4. Maximising the acquisition 
function will also be slightly easier 0(C ,~ d ) since we need to optimise only one group at a time. However, the regret for 
this approach would be O (M \JD^/r f log F) which is a factor of M worse than the regret in our method (This can be show 
by following an analysis similar to the one in section B.2. This is not surprising, since at each iteration you are moving in 
rf-coordinates of the space and you have to wait M iterations before the entire point is updated. 


Algorithm5 Add-GP-UCB-Buggy 

Input: Kernels k^\ ..., k^ m \ Decomposition 

• i — 0 , 

• for j = 1,... ,M, Qio j) ,Ko J) ) ^ (0 ,k w ). 

• for t = 1,2,... 

1. k = j mod M 

2. xf } <- argmax^w /r ( k) {z) + yffka (k \z) 

3. for j ^ k, <— x|i\ 

4. x t t-UjiX* C, ’ ) . 

5- y t <- Query / at x t . 

6. T> t = D t -1 U {(x t ,y t )}. 

7. Perform Bayesian posterior updates to obtain j a t f° r j = 1, • ■ •, M. 
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B. Proofs of Results in Section 4.3 

B.l. Bounding the Information Gain 77 - 

For this we will use the following two results from Srinivas et al. (2010). 

Lemma 7. (Information Gain in GP, (Srinivas et al., 2010) Lemma 5.3 ) Using the basic properties of a GP, they show that 

1 n 

Hva-Ja) = 2^1og(l + ? r 2 °t- 1 0t))- 

z t=l 

where is the posterior variance after observing the first i — 1 points. 

Theorem 8. (Bound on Information Gain, (Srinivas et al., 2010) Theorem 8) Suppose that X is compact and n is a kernel 
on d dimensions satisfying Assumption 2. Let tlx = CgT T log T where C 9 = Ad + 2. For any T* € {1,..., min(T, nr)}, 
let B K (Tf) = Y^ s >t X »■ H ere (An)ngN are the eigenvalues of k w.r.t the uniform distribution over X. Then, 

7 t < inf ( —^ max (T* log (rn T /v 2 ) + <V( 1 - r/T)(T T+1 B K (T*) + 1 ) logT) + 0(T 1 ~ T / d ) \ . 

t \ 1 — e 1 re{i,...,T} ) 


B.1.1. Proof of Theorem 4-1 


Proof. We will use some bounds on the eigenvalues for the simple squared exponential kernel given in (Seeger et al., 
2008). It was shown that the eigenvalues {Ai’’} of rW satisfied A^ < c d B sl/d ‘ where B < 1 (See Remark 9). Since the 
kernel is additive, and D xJ 3> = 0 the eigenfunctions corresponding to rj l> and rS :,) will be orthogonal. Hence the 
eigenvalues of k will just be the union of the eigenvalues of the individual kernels - i.e. {A s } = Ujii{ A^ *}. As B < 1, 

< c d B sl/d . Let T + = |T*/ M J and a = - log B. Then, 

S«(T*) =J2 X s<McJ2 Bsl/d 


A< i} " ~ d 


s>T , 


s>T + 


< c a M \^B T + d + J exp(-ax 1/d )^ dx 

< c d M ( B T + d + da~ d T(d, aTl /d )) 

< c d Me~ aT + d (l + d\da- d (aTl /d ) d -^ . 


The last step holds true whenever aT]J d > 1. Here in the second step we bound the series by an integral and in the third 
step we used the substitution y = ax 1 ^ to simplify the integral. Here F(s, x) = f 00 t s ~ 1 e~ t dt is the (upper) incomplete 
Gamma function. In the last step we have used the following identity and the bound for integral s and x > 1 


S— 1 L. 

x K 


r(s,x) = (s - l)!e X Y^ '-ff < s! 


k=0 


e~ x x d ~ 1 . 


By using t = d and by using T* < (M + 1)T+, we use Theorem 8 to obtain the following bound on yr, 


7 t < 


1/2 


1 - e _1 re{i,...,T} 


max ( (M + 1)T + log(r?rT/?? 2 )+ 


C 9 r] 2 { 1 - r/T) log T (l + c d Me~ aT + d T d+1 (l + d!da~ d (aTl /d ) d ^. 


( 8 ) 


Now we need to pick T + so as to balance these two terms. We will choose T + = 


^ log(Tra T ) ^ ' 


which is less than 


minfor sufficiently large T. Then e aT + = 1/Ttit- Then the first term Si inside the paranthesis is, 
Si = (M + 1) log d j l°g ° (log {Tn T )) d log(rn T )) 
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£ e> (m (log(T d+1 log T)) d log(rT d log T)j 
£ O (Md d+1 (\ogT) d+1 + Md d (\ogT) d \og(r)) . 

Note that the constant in front has exponential dependence on d but we ignore it since we already have d d , (log T) d terms. 
The second term S 2 becomes, 

S 2 = C 9 r, 2 ( 1 - r/T) log T (l + ^ T d+1 (l + d\da~ d {\og{Tn T ) d -^ ) 

< CqT] 2 (1 — r/T) (log T+ C -^~ (l + ti!da _d (log(T?i 7 ’) d_1 )^ ^ 

< <W( 1 - r/T) (O(log T) + 0(1) + 0(d\d d (log T) d-1 )) ^ 
eO(( 1 - r/T)d!d d (logT) d_1 ) . 

Since 5i dominates S 2 , we should choose r = T to maximise the RHS in (8). This gives us, 

7t £ O (Md d+1 (\ogT) d+1 ) £ 0(Dd d (logT) d+1 ) . 

□ 


B.1.2. Proof of Theorem 4-2 

Proof. Once again, we use bounds given in (Seeger et al., 2008). It was shown that the eigenvalues {A^} for 
satisfied A«^ < c d s d i (See Remark 9). By following a similar argument to above we have {A s } = UjlilAs^} 
and < c d s . Let T + = \T*/M\. Then, 

B K (T t ) = A, < Mc d s ~^ < Mcd 

s>T, s>T + 

where C 8 is an appropriate constant. We set T + = (Tnr) 2v+d (log (Tht))^ 2v+d and accordingly we have the following 
bound on 7 t as a function of T+ £ {1,..., min(T, nr) /M}, 


_ 2u + d 

T + d 


< C 8 2 d MT_ 


1- 


u + d 
d 


7 t < inf 


1/2 

1-e- 1 


max 


((M + 1)1+ log(m T /77 2 ) + C 9V 2 (1 - r/T) (log T + C 8 2 d MT+ log (Tn r )) ) + 0(T x ~ T l d ) 

(9) 


Since this is a concave function on r we can find the optimum by setting the derivative w.r.t r to be zero. We get r £ 
0(T/2 d log (Tut)) and hence. 


7 t £ inf [o [MT + log ( 2dl0 g^ nr) )) + O (M2 d T + log (Tn T )) + 0(T^/ d )^J 


£ inf ( O l M2 d log (Tn r ) ( ^ ) + ©(T 1 ^) 
* \ V \( T +!) log(T) + log log ry ) V j 

£ inf (O [M2 d \og(Tn T )T L ^^j +0(T 1 ~ T/d ) s j 

/ , d(d+l) \ 

£ O lM2 d T^+^“+^ log(T) j . 


Here in the second step we have substituted the values for T+ first and then ut- In the last step we have balanced the 
polynomial dependence on T in both terms by setting r = 2 v+d(d+i) 


Q 
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Remark 9. The eigenvalues and eigenfunctions for the kernel are defined with respect to a base distribution on X. In 
the development of Theorem 8, Srinivas et al. (2010) draw nr samples from the uniform distribution on X. Hence, the 
eigenvalues/eigenfunctions should be w.r.t the uniform distribution. The bounds given in Seeger et al. (2008) are for the 
uniform distribution for the Matern kernel and a Gaussian Distribution for the Squared Exponential Kernel. For the latter 
case, Srinivas et al. (2010) argue that the uniform distribution still satisfies the required tail constraints and therefore the 
bounds would only differ up to constants. 


B.2. Rates on Add-GP-UCB 

Our analysis in this section draws ideas from Srinivas et al. (2010). We will try our best to stick to their same notation. 
However, unlike them we also handle the case where the acquisition function is optimised within some error. In the 
ensuing discussion, we will use x t = (J^ xp 1 to denote the true maximiser of (p t - i.e. xp- 1 = argmax 2gA - (;j ) pfi 1 ' 1 (z). 
x t = Uj xp } denotes the point chosen by Add-GP-UCB at the I th iteration. Recall that x t is £o< -1//2 -optimal; I.e. 

^t(x t ) - ^t(x t ) < Co t~ 1/2 . 

Denote p = Y2j dj. i r< denotes a sequence such that Y2t 7r t _1 = 1- For e.g. when we use n t = ir 2 t 2 /6 below, we obtain 
the rates in Theorem 5. 

In what follows, we will construct discretisations VJ J > on each group X <:i > for the sake of analysis. Let 
utj = and LOm = max, utj. The discretisation of the individual groups induces a discretisation fl on X it¬ 
self, D = {x = (J ; - x^ : x^ £ = 1,... ,M}. Let ut = |fi| = ^ utj. We first establish the following two lemmas 

before we prove Theorem 5. 

Lemma 10. Pick S £ (0,1) and set fit = 2 log(u) m AIir t /8). Then with probability >1 — 8, 

M 

Vf > i,Vx £ n, |/(x) -/x t _i(x)| < A 1/2 X] £7 i- 1 (x° ) )- 

i=i 

Proof. Conditioned on V t -i, at any given x and t we have /(x^) ~ <7pP lt j), Vj = 1,... M. Using the 

tail bound, P (z > M ) < \e~ M ~I 2 for 2 ~ W(0,1) we have with probability > 1 — 8/utMn t , 

l/ U) (^ <j) )-Mii l i(x 1 «)| > 1/2 < e _ M2 = S 

fJ^P| (xL'J) U> m MTTt 

By using a union bound utj < ut m times over all x (;;) £ ifid) and then M times over all discretisations the above holds 
with probability > 1 — S/itt for all j = 1,..., M and x^ £ Therefore, we have |/(x) — /r t _i(x) < \f(x^) — 

x^) for all x (E 51. Now using the union bound on all t yields the result. 

1 □ 

Lemma 11. The posterior mean pt-i for a GP whose kernel k(-, x) is L-Lipschitz satisfies, 

P (Vf > 1 \p t -i{x) - p t -i{x’)\ < (/(x*) + py/2 log(7r t /25)) Lp~ 2 t\\x - x’\\^ > 1 - <5. 

Proof. Note that for given t, 

P (yt < /(x*) + rjy/2 log(7r t /25)^) < P {e t /p < yj2 \og(-n t /28)} < S/n t . 

Therefore the statement is true with probability >1 — 5 for all t. Further, A >- r / 2 1 implies || A" -'Wop < r,- 2 and 
\k(x,z) — k(x',z )| < L\\x — tc 7 1|. Therefore 

\p t -i(x) - p t -i(x')\ = |T , ' t I 1 A _1 (fc(x, X T ) - k(x',X T )| < ||F t _i|| 2 ||A _1 || 0 p||fc(a;,X t _i) - k(x',X t _ i)|| 2 
< (/(x*) + pv/21og(7r t /25)^ Lp~ 2 (t - l)||x - a;'|| 2 . 

P 
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B.2.1. Proof of Theorem 5 

Proof. First note that by Assumption 2 and the union bound we have, P(V* sup x ,y) eA <(j) \d/dx^\ > J) < 

d i ae~ ( ' J ^ b ' )2 . Since, df(x)/dx^ = df^{x^)/dxd\ we have. 


V* = 1, 


, D sup 

xGX 


df(x) 


Ox* 


> J I < pae 


-( J/b ) 2 


By setting 5/3 = pae j2 / b " we have with probability > 1 — 5/3, 

Vx, x' G X , |/(x) - f{x')\ < b^\og(3ap/5)\\x - x'||i. 


( 10 ) 


Now, we construct a sequence of discretisations il 1 / 1 ' 1 satisfying ||x^ — [x^^tHli < dj/T t \/xM' ) G Here, [x^] t is 
the closest point to in it/ in an L 2 sense. A sufficient discretisation is a grid with 77 uniformly spaced points. Then 
it follows that for all x G O t , ||x — [x]t||i < p/r*. Here fl t is the discretisation induced on X by the s and [x]t is 
the closest point to x in Q t . Note that ||x(H - [x(3)] t || a < y/dj/n Vx (j) G and ||x - [x] t || 2 < y/p/r t . We will 

set T t = pf 3 -therefore, u>tj < ( pt 3 ) d = U3 m t- When combining this with (10), we get that with probability > 1 — <5/3, 
|/(x) — /([x])| < by/\og(3ap/5)/t 3 . By our choice of (3 t and using Lemma 10 the following is true for all t > 1 and for 
all x G X with probability > 1 — 25/3, 

I fix) - < |/(x) - /([x] t )| + \f([x]t) ~ < Vlog(3°P/^) + fi /2 a^ U) ] t )- dD 

1=1 


By Lemma 11 with probability > 1 — <5/3 we have, 


VxeX, |p t _i(x) - p t _i([x] t )| < 


(/( x *) + vV 2 log(37r t /2<5)) 
y/PV 2 t 2 


( 12 ) 


We use the above results to obtain the following bound on the instantaneous regret r t which holds with probability >1 — 5 
for all t > 1, 


n = /(x*) - /(x t ) 

< *-.«*.].) +a 1/2 £>,“> «*.“],) - »-,([*.].) + & 1/2 f>& (|x«'i,) + 2!> '/ 1 o 8( 3 °p/ |S ) 


3 = 1 


i=i 


< 


< 




2& v /log(3ap/5) L (/( x *) + ^\/21og(7rt/25)) Co 

t 3 + y/PP 2 t 2 + v/f 


Pt 


M M 

I/2 'E^ ) .(*.“)+E‘'. 0 - > i([*. U) ].) 

1=1 f=i 


(13) 


In the first step we have applied Equation (11) at x* and x t . In the second step we have used the fact that p t ([x*] t ) < 
p t (x t ) < pt(xt) + ( 0 t~ 1/2 . In the third step we have used Equation (12). 

For any xGdwe can bound 07 (x ) 2 as follows, 

a t (x) 2 = 77V 2 a t (x) 2 < bg(1 + ? _ 2) log (l + rT 2 <7t(x) 2 ) ■ 

Here we have used the fact that u 2 < x 2 log(l + M 2 )/log(l + v 2 ) for u < v and er t (x) 2 < k(x, x) = 1. Write 
Ci = log _1 (l + p~ 2 ). By using Jensen’s inequality and Definition 3 for any set of T points {xi, x 2 ,... Xt} C X, 

( T M \ 2 T M T 

EE cr t W) (x (j) ) < MT EE «T t W) (xW) < CiMT^2 log (1 + ?r 2 ^(z) 2 ) < ‘ZCiMT'jt. (14) 

t=ii=i / t=ii=i *=i 






















Additive Gaussian Process Optimisation and Bandits 


Finally we can bound the cumulative regret with probability > 1 — 5 for all T > 1 by. 


r 1 / 2 -t- *I/ 2 

£=1 t= 1 

< C 2 {a, b, D, L, 8) + 2CoVf + y/%C 1 p T MTy T - 


T M 


T M 


RT = J2 rt - b > D > s ) +Co f 1/2 + Pt 2 ( a i-A x t ) )+J2Y1 a i ( -i([ x t J) ]o 

k t=i i=i t—i j=i 


where we have used the summability of the first two terms in Equation (13). Here, for S < 0.8, the constant C 2 is given by. 


C 2 > &y/log(3ap/(5) + 


71- 2 L/(x^) 

6y/pr) 2 


Ltt 3 / 2 

y/12p6rj 


a 








